Predicting Vehicle Gas Emissions

Jaspreet Kang, Faizan, Humza, Nishanth

Outline

  1. Introduction
  2. About the data
  3. Exploratory Data Analysis
  4. Modeling
  5. Conclusion

Introduction

Vehicle CO2 emissions play a significant role in environmental impacts. The release of CO2 from vehicles contributes to the greenhouse effect, leading to heat being trapped in the atmosphere and exacerbating global warming. This results in a range of adverse effects, including:

  • higher global temperatures
  • rising sea levels
  • altered weather patterns
  • disruptions to ecosystems
  • air pollution, which can harm human health and further degrade the environment.

Introduction Contd.

Reducing vehicle CO2 emissions is crucial for mitigating climate change and minimizing its harmful effects on both the environment and human well-being. This involves:

  • transitioning to cleaner and more fuel-efficient vehicles
  • promoting the adoption of electric and hybrid vehicles, and improving overall vehicle fuel efficiency
  • encouraging sustainable transportation practices

About the data

Dataset: CO2 Emissions by Vehicle Characteristics (source: Kaggle, last updated July 14, 2023)

Dimensions: 7,385 rows and 12 columns with no missing values

Description: Data on a wide array of car models and their characteristics, including but not limited to:

  • Engine Size
  • Number of Cylinders
  • Fuel Consumption
  • CO2 Emissions

Data Variables

Variable

Variable Type

Variable Description

Data Range/Unique Categories

Make

Categorical

Brand or manufacturer of the vehicle.There are 42 unique values.

ACURA, ALFA ROMEO, ASTON MARTIN, AUDI, BENTLEY, BMW, BUGATTI, BUICK, CADILLAC,…

Model

Categorical

Specific model of the vehicle. There are 2,053 unique values.

4C, A4, A4 QUATTRO, A5 CABRIOLET QUATTRO, A5 QUATTRO, A6 QUATTRO, …

Vehicle Class

Categorical

General category or type of vehicle based on size and purpose. There are 16 unique values.

COMPACT, FULL-SIZE, MID-SIZE, MINICOMPACT, MINIVAN, PICKUP TRUCK - SMALL,…

Engine Size (L)

Quantitative

Engine size in liters.

Ranges from 0.9 to 8.4

Cylinders

Categorical

Number of cylinders in the vehicle's engine, ranging from 3 to 16.

3, 4, 5, 6, 8, 10, 12, 16

Transmission

Categorical

The type and number of gears in the vehicle's transmission. There are 27 unique values.

A10, A4, A5, A6, A7, A8, A9, AM5, AM6, AM7, AM8, AM9, AS10, AS4, AS5, AS6, AS7, AS8, AS9, AV, AV10, AV6, AV7, AV8, M5, M6, M7

Fuel Type

Categorical

The type of fuel used by the vehicle. There are 5 unique values notated as one-letter codes: Z = Premium gasoline; D = Diesel; X = Regular gasoline; E = Ethanol (E85); N = Natural gas

Z, D, X, E, N

Fuel Consumption - City (L/100km)

Quantitative

The estimated fuel consumption rate for city driving conditions, measured in liters per 100 kilometers (L/100km).

Ranges from 4.2 to 30.6

Fuel Consumption - Highway (L/100km)

Quantitative

The estimated fuel consumption rate for highway driving conditions, measured in liters per 100 kilometers (L/100km).

Ranges from 4.0 to 20.6

Fuel Consumption - Combined (L/100km)

Quantitative

The estimated average fuel consumption rate for combined city and highway driving conditions, measured in liters per 100 kilometers (L/100km).

Ranges from 4.1 to 26.1

Fuel Consumption - Combined (mpg)

Quantitative

The estimated average fuel consumption rate for combined city and highway driving conditions, measured in miles per gallon (mpg).

Ranges from 11 to 69

CO2 Emissions (g/km)

Response - Quantitative

The estimated carbon dioxide emissions produced by the vehicle, measured in grams per kilometer (g/km)

Ranges from 96 to 522

Data Sample

Subset of Vehicle Emissions Dataset

Data Exploration

Categorical Variables - Percentage Frequency

Categorical Variables - Percentage Frequency

Categorical Variables - Percentage Frequency

Continuous Variables

Continuous Variables

Exploring Relationships

Scatterplot Matrix with Correlations

Observations

  • All continuous variables’ distributions are right-skewed -> transformations required
  • Strong correlations between all continuous predictors and the response (CO2 Emissions), but some non-linear relationships exist.
  • Multiple predictors provide the same information about fuel consumption
    • >|0.9| correlation between all
  • Presence of unrelated or parallel regression lines between Fuel Consumption and CO2 Emissions

Closer Look at Fuel Consumption and CO2 Emissions

Closer Look at Fuel Consumption and CO2 Emissions

Multiple Linear Regression

Setup

  1. Split the data into a training and testing set
    1. 70% training, 30% testing
  2. Chose one variable about fuel consumption & excluded other fuel consumption variables
    1. Combined Fuel Consumption (mpg)
      1. Log transformation to correct right-skewness
  3. To reduce model complexity and enhance interpretability:
    1. Did not include vehicle brand or vehicle model
    2. Grouped categorical variables
      1. Transmission, Class, Fuel Type, Cylinders, Engine Size
  4. Applied One-hot encoding
  5. Note: Only one data point where fuel type is natural gas. Therefore, this data point was removed.

Grouping Categorical Variables

Variable

Original Categories

New Categories

Transmission

A4, A5, A6, A7, A8, A9, A10, M5, M6, M7, AM5, AM6, AM7, AM8, AM9, AV6, AV7, AV8, AV10, AV, AS4, AS5, AS6, AS7, AS8, AS9, AS10

A (Automatic), M (Manual)

Fuel Type

D (Diesel), E (Ethanol), N (Natural Gas), X (Regular gasoline), Z (Premium Gasoline)

D, E, X/Z

Cylinders

3, 4, 5, 6, 8, 10, 12, 16

3-5, 6, 8, 10 or more

Class

COMPACT, MINICOMPACT, SUBCOMPACT, SUV - SMALL, SUV - STANDARD, MID-SIZE, FULL-SIZE, STATION WAGON - SMALL, STATION WAGON - MID-SIZE, PICKUP TRUCK - STANDARD, PICKUP TRUCK - SMALL, SPECIAL PURPOSE VEHICLE, TWO-SEATER, VAN - CARGO, VAN - PASSENGER, MINIVAN

Compact, SUV, Mid/Full Size, Station Wagon, Pickup Truck, Special, Two-seater, Van

Engine Size

0.9, 1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.8, 2, 2.1, 2.2, 2.3, 2.4, 2.5, 2.7, 2.8, 2.9, 3, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4, 4.2, 4.3, 4.4, 4.6, 4.7, 4.8, 5, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 8, 8.4

Small engines (0.9-1.5L) Midrange engines (1.6-2.9L) Large engines (3.0-4.9L) Very large engines (>=5.0L)

Correlation Matrix - New Groupings

Full Model

  • Response: CO2 Emissions (g/km)

  • Predictors:

    • Engine Size (categorical)
    • Log of Combined Fuel Consumption in mpg (continuous)
    • Cylinders (categorical)
    • Transmission (categorical)
    • Fuel Type (Categorical)
    • Vehicle Class (Categorical)
    • Note: All categorical variables (with more than two categories) converted into multiple binary variables (one-hot encoding)

Full Model Formula

\(CO2\ Emissions =\\ \beta_{intercept} + \beta_{fuelConsumption}log(fuelConsumption) +\\ \beta_{cylinders6}cylinders6 + \beta_{cylinders8}cylinders8 +\\ \beta_{cylinders10orMore}cylinders10ormore +\beta_{manual}manual +\\ \beta_{Diesel}Diesel + \beta_{Ethanol}Ethanol +\\ \beta_{midEngine}midEngine + \beta_{largeEngine}largeEngine +\\ \beta_{XLEngine}XLEngine +\\ (all\ beta\ coefficients\ for\ vehicle\ classes) + \epsilon\)

Full Model - Summary

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

967.094

3.707

260.877

0.0000

***

D

30.354

0.819

37.042

0.0000

***

E

-81.410

0.641

-127.025

0.0000

***

cylinders6

1.183

1.257

0.941

0.3470

`cylinders10 or more`

36.068

1.713

21.058

0.0000

***

cylinders8

12.678

1.408

9.002

0.0000

***

transM

-0.356

0.351

-1.013

0.3113

SUV

2.569

0.364

7.056

0.0000

***

`Mid/Full Size`

1.775

0.347

5.116

0.0000

***

`Two-seater`

2.821

0.542

5.209

0.0000

***

`Station Wagon`

1.595

0.636

2.508

0.0122

*

Van

17.386

0.890

19.525

0.0000

***

`Pickup Truck`

4.944

0.520

9.507

0.0000

***

Special

0.899

1.157

0.777

0.4372

`Midrange engines`

-3.557

0.542

-6.565

0.0000

***

`Large engines`

-2.764

1.357

-2.036

0.0418

*

`Very large engines`

-0.172

1.489

-0.116

0.9080

log(fuel_consume_comb_mpg)

-218.441

1.006

-217.231

0.0000

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 8.605 on 5150 degrees of freedom

Multiple R-squared: 0.9784, Adjusted R-squared: 0.9784

F-statistic: 1.375e+04 on 5150 and 17 DF, p-value: 0.0000

Model 1 - Checking Assumptions

Model 1 - Conclusions

  • In conclusion, most or all assumptions were violated.

    • Presence of nonconstant variance - shows a polynomial pattern
    • Errors were not normally distributed
    • Slight curvature in the relationship between the fitted and observed values
    • Many outliers and high leverage points

Response Transformation using Box-Cox

Model 2 - Log Transformation of Response

Model 2 - Observations

  • The linearity assumption seems to be satisfied.
  • The errors are more normally distributed.
  • Stability of the variance improved; However non-constant variance still exists with the spread of the standardized residuals increasing as the fitted values increase.
  • Many high leverage points, and several outliers requiring investigation.
  • Several insignificant and collinear predictors

Assess Multicollinearity via VIF

Predictor

VIF

`Large engines`

30.428907

cylinders6

24.526314

cylinders8

21.224240

`Very large engines`

20.362216

`cylinders10 or more`

5.471292

`Midrange engines`

4.907806

log(fuel_consume_comb_mpg)

4.678740

SUV

1.814692

`Pickup Truck`

1.569161

`Mid/Full Size`

1.526035

E

1.384673

`Two-seater`

1.237701

Van

1.183487

transM

1.162464

`Station Wagon`

1.135989

Special

1.089745

D

1.088911

Remove Co-linear Predictors

  • Four predictors found to be co-linear

  • VIF values were still high after removing only one predictor.

  • Four models tested; Model 3b, which removes cylinders6 and Very large engines, yields the highest \(R^{2}_{adj}\) value.

Potential Model

Predictors Removed

Adjusted R Squared

Model 3a

cylinders6 and Large engines

0.9954583

Model 3b

cylinders6 and Very large engines

0.9954670

Model 3c

cylinders8 and Large engines

0.9954558

Model 3d

cylinders8 and Very large engines

0.9954576

Check VIF Values After Removing Co-linear Variables

Predictor

VIF

log(fuel_consume_comb_mpg)

3.842805

`Large engines`

3.243324

`Midrange engines`

3.127342

cylinders8

2.784298

`cylinders10 or more`

1.983976

SUV

1.811370

`Pickup Truck`

1.560971

`Mid/Full Size`

1.514486

E

1.365399

`Two-seater`

1.232155

Van

1.182737

transM

1.150668

`Station Wagon`

1.134814

Special

1.085034

D

1.081219

Model 3b Summary

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

8.739

0.006

1,459.746

0.0000

***

D

0.139

0.002

92.259

0.0000

***

E

-0.349

0.001

-297.342

0.0000

***

`cylinders10 or more`

0.011

0.002

5.880

0.0000

***

cylinders8

0.003

0.001

3.274

0.0011

**

transM

-0.001

0.001

-2.138

0.0325

*

SUV

0.003

0.001

4.746

0.0000

***

`Mid/Full Size`

0.002

0.001

3.100

0.0019

**

`Two-seater`

0.001

0.001

1.297

0.1948

`Station Wagon`

0.002

0.001

1.339

0.1805

Van

0.002

0.002

1.483

0.1382

`Pickup Truck`

0.007

0.001

7.196

0.0000

***

Special

0.004

0.002

1.974

0.0484

*

`Midrange engines`

-0.000

0.001

-0.411

0.6813

`Large engines`

0.003

0.001

3.683

0.0002

***

log(fuel_consume_comb_mpg)

-0.985

0.002

-586.465

0.0000

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 0.01587 on 5152 degrees of freedom

Multiple R-squared: 0.9955, Adjusted R-squared: 0.9955

F-statistic: 7.565e+04 on 5152 and 15 DF, p-value: 0.0000

Model 3b Assumptions Checks

Identify Outliers

  • Three outliers with |standardized residuals| greater than 4 exist.

make

model

class

engine_size

cylinders

transmission

fuel_type

fuel_consume

co2_emissions

stres

MERCEDES-BENZ

GL 450 4MATIC

SUV - STANDARD

3

6

AS7

Z

20

298

-6.1

MERCEDES-BENZ

B 250

MID-SIZE

2

4

AS7

Z

34

179

-5.0

MERCEDES-BENZ

AMG CLS 53 4MATIC+

COMPACT

3

6

A9

Z

26

235

-4.5

Model 4 - Outliers Removed

Residuals Against Each Predictors

Added Variable Plots

Model 5 - Stepwise Selection using AIC

  • In total, 4 predictors were dropped - Two-seater, Station Wagon, Van, Midrange engines
  • Adjusted Coefficients of Determination and the ANOVA test suggests to proceed with reduced model.

Res.Df

RSS

Df

Sum of Sq

F

Pr(>F)

5,153

1.277295

5,149

1.276184

4

0.001110544

1.120173

0.3449717

Model 5 Summary

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

8.742

0.006

1,546.292

0.0000

***

D

0.139

0.001

93.230

0.0000

***

E

-0.349

0.001

-301.125

0.0000

***

`cylinders10 or more`

0.011

0.002

6.429

0.0000

***

cylinders8

0.003

0.001

3.679

0.0002

***

transM

-0.002

0.001

-2.452

0.0142

*

SUV

0.003

0.001

4.291

0.0000

***

`Mid/Full Size`

0.001

0.001

2.539

0.0111

*

`Pickup Truck`

0.006

0.001

6.943

0.0000

***

Special

0.003

0.002

1.636

0.1019

`Large engines`

0.003

0.001

5.481

0.0000

***

log(fuel_consume_comb_mpg)

-0.986

0.002

-611.911

0.0000

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 0.01574 on 5153 degrees of freedom

Multiple R-squared: 0.9955, Adjusted R-squared: 0.9955

F-statistic: 1.047e+05 on 5153 and 11 DF, p-value: 0.0000

Model 5 Assumption Checks

Computing Training and Testing RMSE

Type

RMSE

Training

4.326869

Testing

4.519031

Final Model

\(log(CO2\ Emissions) =\\ 8.7 + 0.14Diesel - 0.35Ethanol\\ + 0.01cylinders10ormore + 0.003cylinders8\\ - 0.0016transM + 0.0026SUV + 0.0015MidFullSize\\ + 0.0062PickupTruck + 0.0034SpecialVehicle\\ + 0.003LargeEngines - 0.986log(fuelConsumption)\)

Conclusion

  • Training and Testing RMSE did not substantially differ, indicating the model performs well and does not overfit.

  • While the variance of the residuals improved from the initial model, we could not correct the nonconstant variance despite a variety of methods attempted.

  • Data quality concerns: Found several cases where either the fuel efficiency or CO2 emission values in the dataset did not correpond with credible sources such as fueleconomy.gov.

  • Furthermore, half of FFV (flex-fuel vehicles) were labeled as E (Ethanol), and the other half as X/Z (Regular/Premium Gasoline) despite no other differences being observed with these vehicle characteristics.